Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR derives from PR 16847 in mainline.

On my GPU (RTX-4080) it is a very minor improvement over the main branch (~0.5% better TG for GPT-OSS-20B-MXFP4, less for other models). But based on the discussion in the mainline PR, it may lead to larger performance gains for low memory bandwidth GPUs.

The PR also adds the -mmvq | --merge-qkv option (see #878) to llama-bench.

@ikawrakow ikawrakow merged commit fd3757d into main Oct 31, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025
ikawrakow pushed a commit that referenced this pull request Nov 11, 2025
ikawrakow added a commit that referenced this pull request Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants